Google’s Gemini models are currently outperforming rivals in social and strategic games. Google DeepMind, together with Kaggle, has expanded its Game Arena benchmark with two new titles: Werewolf and Poker. The platform is designed to evaluate AI models through competitive games that test different cognitive skills.
Each game targets a distinct capability. Chess measures logical reasoning, Werewolf evaluates social intelligence such as communication, deception detection, and theory of mind, while Poker tests decision-making under uncertainty, risk management, and incomplete information.
According to the latest results, Gemini 3 Pro and Gemini 3 Flash currently top all leaderboards across the Game Arena benchmarks. The Werewolf benchmark also plays a role in AI safety research, as it allows researchers to assess whether models can detect manipulation and deceptive behavior without exposing them to real-world risks.
Google DeepMind CEO Demis Hassabis said the results highlight the need for more demanding and realistic evaluations of next-generation AI systems, arguing that the industry requires tougher benchmarks to properly assess emerging model capabilities.
ES
EN